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Abstract 

An ultrametric topology formalizes the notion of hierarchical struc- 
ture. An ultrametric embedding, referred to here as ultrametricity, is 
implied by a natural hierarchical embedding. Such hierarchical structure 
can be global in the data set, or local. By quantifying extent or degree 
of ultrametricity in a data set, we show that ultrametricity becomes per- 
vasive as dimensionality and/or spatial sparsity increases. This leads us 
to assert that very high dimensional data are of simple structure. We 
exemplify this finding through a range of simulated data cases. We dis- 
cuss also application to very high frequency time series segmentation and 
modeling. 

PACS: 02.50.-r, 05.45.Tp, 89.65.Gh, 89.20.-a 

1 Introduction 

The topology or inherent shape and form of an object is important. In data 
analysis, the inherent form and structure of data clouds are important. Quite 
a few models of data form and structure are used in data analysis. One of 
them is a hierarchically embedded set of clusters, - a hierarchy. It is traditional 
(since at least the 1960s) to impose such a form on data, and if useful to assess 
the goodness of fit. Rather than fitting a hierarchical structure to data (e.g., 
[23] 1. our recent work has taken a different orientation: we seek to find (partial or 
global) inherent hierarchical structure in data. As wc will describe in this article, 
there are interesting findings that result from this, and some very interesting 
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perspectives are opened up for data analysis and, potentially, perspectives also 
on the physics (or causal or generative mechanisms) underlying the data. 

A formal definition of hierarchical structure is provided by ultrametric topol- 
ogy (in turn, related closely to p-adic number theory). We will return to this in 
section [5] below. First, though, we will summarize some of our findings. 

Ultrametricity is a pervasive property of observational data. It arises as a 
limit case when data dimensionality or sparsity grows. More strictly such a limit 
case is a regular lattice structure and ultrametricity is one possible represen- 
tation for it. Notwithstanding alternative representations, ultrametricity offers 
computational cfRciency (related to tree depth/height being logarithmic in num- 
ber of terminal nodes), linkage with dynamical or related functional properties 
(phylogenetic interpretation), and processing tools based on well known p-adic 
or ultrametric theory (examples: deriving a partition, or applying an ultramet- 
ric wavelet transform). In }1 1 j and other works, Khrennikov has pointed to the 
importance of ultrametric topological analysis. 

Local ultrametricity is also of importance. This can be used for forensic data 
exploration (fingerprinting data sets): see [T^ and [T^; and to expedite search 
and discovery in information spaces: see [6| as discussed by us in [14], [18], and 

m- 

In section [2] we show how extent of ultrametricity is measured. Section [3] 
presents our main results on the remarkable properties of very high dimensional, 
or very sparse, spaces. As dimensionality or sparsity grow, so does the inherent 
hierarchical nature of the data in the space. In section 14.21 we then discuss 
application to very high frequency time series modeling. 

2 Quantifying Degree of Ultrametricity 

Summarizing a full description in Murtagh [14| we explored two measures quan- 
tifying how ultrametric a data set is, - Lerman's and a new approach based on 
triangle invariance (respectively, the second and third approaches described in 
this section). 

The triangular inequality holds for a metric space: d{x, z) < d{x, y) + d{y, z) 
for any triplet of points x,y,z. In addition the properties of symmetry and 
positive definiteness are respected. The "strong triangular inequality" or ul- 
trametric inequality is: d{x,z) < max {d{x,y),d{y, z)} for any triplet x,y,z. 
An ultrametric space implies respect for a range of stringent properties. For 
example, the triangle formed by any triplet is necessarily isosceles, with the two 
large sides equal; or is equilateral. 

• Firstly, Rammal et al. [22] used discrepancy between each pairwise dis- 
tance and the corresponding subdominant ultrametric. Now, the subdom- 
inant ultrametric is also known as the ultrametric distance resulting from 
the single linkage agglomerative hierarchical clustering method. Closely 
related graph structures include the minimal spanning tree, and graph 
(connected) components. While the subdominant provides a good fit to 
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the given distance (or indeed dissimilarity), it suffers from tlie "friends of 
friends" or chaining effect. 

• Secondly, Lerman [12] developed a measure of ultrametricity, termed H- 
classifiability, using ranks of all pairwise given distances (or dissimilari- 
ties). The isosceles (with small base) or equilateral requirements of the 
ultrametric inequality impose constraints on the ranks. The interval be- 
tween median and maximum rank of every set of triplets must be empty 
for ultrametricity. We have used extensively Lerman's measure of degree 
of ultrametricity in a data set. Taking ranks provides scale invariance. 
But the limitation of Lerman's approach, we find, is that it is not reason- 
able to study ranks of real-valued (values in non-negative reals) distances 
defined on a large set of points. 

• Thirdly, our own measure of extent of ultrametricity [14J can be described 
algorithmically. We examine triplets of points (exhaustively if possible, or 
otherwise through sampling), and determine the three angles formed by 
the associated triangle. We select the smallest angle formed by the triplet 
points. Then we check if the other two remaining angles are approximately 
equal. If they are equal then our triangle is isosceles with small base, or 
equilateral (when all triangles are equal). The approximation to equality 
is given by 2 degrees (0.0349 radians). Our motivation for the approximate 
( "fuzzy" ) equality is that it makes our approach robust and independent 
of measurement precision. 

A supposition for use of our measure of ultrametricity is that we can can 
define angles (and hence triangle properties). This in turn presupposes a scalar 
product. Thus we presuppose a complete normed vector space with a scalar 
product - a Hilbert space - to provide our needed environment. 

Quite a general way to embed data, to be analyzed, in a Euclidean space, 
is to use correspondence analysis |17j . This explains our interest in using cor- 
respondence analysis: it provides a convenient and versatile way to take input 
data in many varied formats (e.g., ranks or scores, presence/absence, frequency 
of occurrence, and many other forms of data) and map them into a Euclidean, 
factor space. 

3 Ultrametricity and Dimensionality 

3.1 Distance Properties in Very Sparse Spaces 

Murtagh [M] , and earlier work by Rammal et al. [HI [22] , has demonstrated 
the pervasiveness of ultrametricity, by focusing on the fact that sparse high- 
dimensional data tend to be ultrametric. In such work it is shown how numbers 
of points in our clouds of data points are irrelevant; but what counts is the am- 
bient spatial dimensionality. Among cases looked at are statistically uniformly 
(hence "unclustered" , or without structure in a certain sense) distributed points. 
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and statistically uniformly distributed hypercube vertices (so the latter are ran- 
dom 0/1 valued vectors). Using our ultrametricity measure, there is a clear 
tendency to ultrametricity as the spatial dimensionality (hence spatial sparse- 
ness) increases. 

As |H] also show, Gaussian data behave in the same way and a demonstration 
of this is seen in Table [T] To provide an idea of consensus of these results, 
the 200,000-dimensional Gaussian was repeated and yielded on successive runs 
values of the ultrametricity measure of: 0.96, 0.98, 0.96. 

In the following, we explain why high dimensional and/or sparsely populated 
spaces are ultrametric. 

As dimensionality grows, so too do distances (or indeed dissimilarities, if 
they do not satisfy the triangular inequality). The least change possible for 
dissimilarities to become distances has been formulated in terms of the smallest 
additive constant needed, to be added to all dissimilarities [211 HI 13 EO] . Adding 
a sufficiently large constant to all dissimilarities transforms them into a set 
of distances. Through addition of a larger constant, it follows that distances 
become approximately equal, thus verifying a trivial case of the ultrametric or 
"strong triangular" inequality. Adding to dissimilarities or distances may be a 
direct consequence of increased dimensionality. 

For a close fit or good approximation, the situation is not as simple for tak- 
ing dissimilarities, or distances, into ultrametric distances. A best fit solution 
is given by [7] (and software is available in R [TO])- If we want a close fit to the 
given dissimilarities then a good choice would avail either of the maximal infe- 
rior, or subdominant, ultrametric; or the minimal superior ultrametric. Stepwise 
algorithms for these are commonly known as, respectively, single linkage hierar- 
chical clustering; and complete link hierarchical clustering. (See [3l [121 US] and 
other texts on hierarchical clustering.) 

3.2 No "Curse of Dimensionality" in Very High Dimen- 
sions 

Bellman's [5] "curse of dimensionality" relates to exponential growth of hy- 
pervolume as a function of dimensionality. Problems become tougher as di- 
mensionality increases. In particular problems related to proximity search in 
high-dimensional spaces tend to become intractable. 

In a way, a "trivial limit" (Treves [23) case is reached as dimensionality in- 
creases. This makes high dimensional proximity search very different, and given 
an appropriate data structure ~ such as a binary hierarchical clustering tree - 
we can find nearest neighbors in worst case 0(1) or constant computational time 
[T^ . The proof is simple: the tree data structure affords a constant number of 
edge traversals. 

The fact that limit properties are "trivial" makes them no less interesting 
to study. Let us refer to such "trivial" properties as (structural or geometrical) 
regularity properties (e.g. all points lie on a regular lattice). 

First of all, the symmetries of regular structures in our data may be of 
importance. For example, processing of such data can exploit these regularities. 
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No. points Dimen. Isosc. Equil. UM 



Uniform 



100 


20 


0. 


,10 


0.03 


0.13 


100 


200 


0. 


,16 


0.20 


0.36 


100 


2000 


0. 


,01 


0.83 


0.84 


100 


20000 







0.94 


0.94 


100 


200000 







0.97 


0.97 


Hypercube 












100 


20 


0. 


,14 


0.02 


0.16 


100 


200 


0. 


,16 


0.21 


0.36 


100 


2000 


0. 


,01 


0.86 


0.87 


100 


20000 







0.96 


0.96 


100 


200000 







0.97 


0.97 


Gaussian 












100 


20 


0. 


,12 


0.01 


0.13 


100 


200 


0. 


,23 


0.14 


0.36 


100 


2000 


0. 


,04 


0.77 


0.80 


100 


20000 







0.98 


0.98 


100 


200000 







0.96 


0.96 



Table 1: Typical results, based on 300 sampled triangles from triplets of 
points. For uniform, the data are generated on [0, 1]; hypercube vertices are in 
{0, ijDimen^ ^j. Gaussian, the data are of mean 0, and variance 1. Dimen. 
is the ambient dimensionality. Isosc. is the number of isosceles triangles with 
small base, as a proportion of all triangles sampled. Equil. is the number of 
equilateral triangles as a proportion of triangles sampled. UM is the proportion 
of ultrametricity-respecting triangles (= 1 for all ultrametric) . 
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Secondly, "islands" or clusters in our data, where each "island" is of regular 
structure, may be of interpretational value. 

Fourthly, and finally, regularity of particular properties does not imply reg- 
ularity of all properties. So, for example, we may have only partial existence of 
pairwise linkages. 

Thus we see that in very high dimensions, and/or in very (spatially) sparse 
data clouds, there is a simplification of structure, which can be used to mitigate 
any "curse of dimensionality" . Figure [T] shows how the distances within and 
between clusters become tighter with increase in dimensionality. 

3.3 Gaussian Clusters in Very High Dimensions 

3.3.1 Introduction 

We will distinguish between cluster characteristics as follows: 

1. cluster size: number of points per cluster; 

2. cluster location: here, mean, identical on every dimension; 

3. cluster scale: here, standard deviation, identical on every dimension. 

These cluster characteristics are simple ones, and future work will consider 
greater sophistication. 

In the homogeneous clouds studied in Table [1] it is seen that the isosceles 
(with small base) case disappeared early on, as dimensionality increased greatly, 
to the advantage of the equilateral case of ultrametricity. So the points become 
increasingly equilateral-related as dimensionality grows. This is not the case 
when the data in clustered, as we will now see. 

3.3.2 Clusters with Different Locations, Same Scale 

Table [5] is based on two clusters, and shows how isosceles triangles increasingly 
dominate as dimensionality grows. Figure [2] illustrates low and high dimension- 
ality scenarios relating to Table [H There is clear confirmation in this table as 
to how interrelationships in the cluster become more compact and, in a certain 
sense, more trivial, in high dimensions. This does not obscure the fact that 
we indeed have hierarchial relationships becoming ever more pronounced as di- 
mensionality, and hence relative sparsity, increase. These observations help us 
to see quite clearly just how hierachical relationships come about, as ambient 
dimensionality grows. 

3.3.3 Clusters with Different Locations, Different Scales 

A more demanding case study is now tried. We generate 50 points per cluster 
with the following characteristics: mean 0, standard deviation 1, on each di- 
mension; mean 3, standard deviation 2, on each dimension; mean 5, standard 
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Figure 1 : An illustration of how "symmetry" or "structure" can become increas- 
ingly pronounced as dimensionality increases. Shown are two simulations, each 
with 3 sub-populations of Gaussian-distributed data, in, respectively, ambient 
dimensions of 2000 and 20,000. These simulations correspond to the 3rd last, 
and 2nd last, rows of Table [TJ 
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Figure 2: A further illustration of how "symmetry" or "structure" can become 
increasingly pronounced as dimensionality increases, relating to the 200 x 20 
and 200 x 20, 000 (first of the succession of rows) cases of Table [2] These are 
histograms of all interpoint distances, based on two Gaussian clusters. The first 
has mean and standard deviation 1 on all dimensions. The second has mean 
10 and standard deviation 1 on all dimensions. 
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Isosc 


Ec[i_iil 


UM 


200 


20 


0.08 





0.08 


200 


200 


0.19 


0.04 


0.23 


200 


2000 


0.42 


0.20 


0.62 


200 


20000 


0.74 


0.22 


0.96 


200 


20000 


0.7 


0.28 


0.98 


200 


20000 


0.77 


0.21 


0.98 


200 


20000 
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0.21 


0.98 


200 


20000 


0.75 


0.24 


0.99 


200 


20000 


0.73 


0.25 


0.98 



Table 2: Results based on 300 sampled triangles from triplets of points. Two 
Gaussian clusters, each of 100 points, were used in each case. One point set 
was of mean 0, and the other of mean 10, on each dimension. The standard 
deviations on each dimension were 1 in all cases. Column headings are as in 
Table [TJ Five further results are given for the 20,000-dimension case to show 
variability. 

deviation 1, on each dimension; and mean 8, standard deviation 3, on each di- 
mension. Table [3] shows the results obtained. Here we have not achieved quite 
the same level of ultrametricty, due to slower growth in ultrametricity which is, 
in turn, due to the more murky, less dermarcated, but undoubtdely clustered, 
set of data. Figure [3] illustrates this: this histogram shows one dimension, where 
we note that means of the Gaussians are at 0, 3, 5 and 8. 

When we look closer at Table [3l as shown in Figure [H the compaction of 
distances is again very interesting. We verified the 7 peaks found in the lower 
histogram in Figure IH and available but confusedly overlapping and ill-defined 
in the upper histogram of Figure 

What we find for the 7 peaks is as follows. Distances within the clusters 



No. points 


Dimen. 


Isosc. 


Equil. 


UM 


200 
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0.01 
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0.06 


0.34 
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0.08 


0.58 
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0.55 


0.11 


0.66 



Table 3: Results based on 300 sampled triangles from triplets of points. Four 
Gaussian clusters, each of 50 points, were used in each case. See text for details 
of properties of these clusters. 
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Figure 3: A projection onto one dimension, to illustrate the less than clearcut 
clustering problem addressed. There are four Gaussians here, each of 50 real- 
izations, with means at 0, 3, 5 and 8, and with respective standard deviations 
of 1, 2, 1, 3. 



10 



correspond to: peaks 1, 2, 3 and (again) 1. That two clusters are associated 
with one peak is clear from the fact that two of our clusters are of identical 
scale. 

We can examine inter-cluster distances and we found these to be associated 
with peaks: 2, 3, 4, 5, 6, 7. Given 4 clusters, we could well have up to 6 possible 
additional peaks. 

3.3.4 Conclusions on High Dimensional Gaussian Clouds 

From these case studies, it is clear that increased dimensionality sharpens and 
distinguishes the clusters. If we can embed data ~ any data - in a far higher 
ambient dimensionality, without destroying the interpretable relationships in 
the data, then we can so much more easily read off the clusters. 

To read off clusters, including memberships and properties, our findings can 
be summarized as follows. 

For cluster size (i.e., numbers of points per cluster), sampling alone can be 
used, and we do not pursue this here. 

For cluster scale (i.e., standard deviation, assumed the same on each di- 
mension), we associate each cluster, or a pair of clusters, with each peak. The 
total number of peaks gives an upper bound on the number of clusters. (For k 
clusters, we have < k + k ■ {k — l)/2 peaks.) 

Using cluster scale also permits use of the following cluster model: suppose 
that all clusters are defined to have intra-cluster distance that is less than inter- 
cluster distance. Then it follows that the peaks of lower distance correspond to 
the clusters (as opposed to pairs of clusters). 

An example of this is as follows. In Figure IH lower panel, we read from left 
to right, applying the following algorithm: select the first k peaks as clusters, 
and ask: are there sufficient peaks to represent all inter-cluster pairs? If we 
choose k = 3, there remain 4 peaks, which is too many to account for the inter- 
cluster pairs (i.e., 3 • (3 — l)/2)). So we see that Figure [4] is incompatible with 
fc = 3 or the presence of just 3 clusters. 

Consequently we move to k — A, and see that Figure S] is consistent with 
this. 

A further identifiability assumption is reasonable albeit not required: that 
all smallest peaks be associated with intra-cluster distances. This need not be 
so, since we could well have a dense cluster superimposed on a less dense one. 
However it is a reasonable parimony assumption. Supported by this assumption. 
Figure m points to a minimum of 4 clusters in the data, with up to 4 peaks (read 
off from left to right, i.e., in increasing order of distance) corresponding to these 
clusters. 
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Figure 4: Compaction of distances with rise in dimensionality: 4 clusters, sub- 
stantially overlapping are the basis for the histograms of all pairwise distances. 
Top: ambient dimensionality 20. Bottom: ambient dimensionality 200,000. 
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4 Applications 



4.1 Data Recoding in the Correspondence Analysis Tra- 
dition 

The iris data has been very widely used as a toy data set since Fisher used it 
in 1936 ([8], taken from T) to exemphfy discriminant analysis. It consists of 
150 iris flowers, each characterized by 4 petal and sepal, width and breadth, 
measurements. On the one hand, therefore, we have the 150 irises in R"*. Next, 
each variable value was recoded by us to be a rank (all ranks of a given variable 
considered) and the rank was boolean-coded (viz., for the top rank variable 
value, 1000..., for the second rank variable value, 0100..., etc.). Following 
removal of zero total columns, the second data set defined the 150 irises in R^^^. 
Actually, this definition of the 150 irises is in fact in {0, 1}^^^. 

Our triangle-based measure of the degree of ultrametricity in a data set 
(here the set of irises), with = no ultrametricity, and 1 = every triangle an 
ultrametric-respecting one, gave the following: for irises in R^, 0.017; and for 
irises in {0, Ij^^s. 9.948. 

This provides a nice illustration of how recoding can dramatically change 
the picture provided by one's data. In chapter 3 of [17| it is discussed just 
what change in the data cloud is caused by the recoding. Our objective here 
is not to pursue the goodness of fit or otherwise of one data encoding vis-a-vis 
another. Instead our objective is to point out how data encoding influences 
directly (and at times remarkably) the data cloud's ultrametricity, or ease of 
being hierarchically embedded. 

In correspondence analysis, the ^ distance when used on data tables with 
constant marginal sums becomes a weighted Euclidean distance. This is im- 
portant for us as data analyst, because it means that we can directly influence 
the analysis by equi-weighting, say, the table rows in the following way: we 
double the row vector values by including an absence (0 value) whenever there 
is a presence (1 value) and vice versa. Or for a table of percentages, we take 
both the original value x and 100 — x. In the correspondence analysis tradition 
[3l [TT] this is known as doubling (dedoublement) . More generally, booleanizing, 
or making qualitative, data in this way, for a varying (value-dependent) number 
of target value categories (or modalities) leads to the form of coding known as 
complete disjunctive form. 

Such coding increases the embedding dimension, and data sparseness. From 
our example of recoding the Fisher data, such coding can influence degree of 
ultrametricity. We conclude that careful data coding can increase the extent to 
which our data is inherently hierarchical. Furthermore the latter in turn may 
be beneficial in enhancing data interpretability (for example, by unravelling 
phylogenetic aspects expressed by the data). 
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No. time series Dimcn. Isosc. Equil. UM 



100 
100 
100 



2000 0.17 
20000 0.15 
200000 0.03 



0.32 

0.5 

0.57 



0.49 
0.65 
0.60 



Table 4: Results based on 300 sampled triangles from triplets of points. Two 
sets of the ARIMA models are used, each of 50 realizations. 

4.2 Application to High Frequence Data Analysis 

In this section we establish proof of concept for application of the foregoing 
work to analysis of very high frequency univariate time series signals. 

Consider each of the cases considered in section [?751 expressed there as n x m 
arrays, as instead representing n segments, each of (contiguous) length m, of 
a time series or one-dimensional signal. Assuming our aim is to cluster these 
segments on the basis of their properties, then it is reasonable to require that 
they be non-overlapping. The n segments could come from anywhere, in any 
order, in the time series. So for the case of an n x to array considered previously, 
then implies a time series of length at least nm. The most immediate way to 
construct the time series is to raster scan the n x m array, although alternatives 
come readily to mind. 

The methodology discussed in section [3731 then is seen to be also a time series 
segmentation approach, facilitating the characterizing of the segments used. 

To explore this further we consider a time series consisting of two ARIMA 
(autoregressive integrated moving average) models, with parameters: order, au- 
toregression coefficients, moving average coefficients, and a "mildly longtailed" 
set of innovations based on the Student t distribution with 5 degrees of freedom. 
Figures [5] and [6] show samples of these time series segments. Figures [7] and [8] 
show histograms of these samples. 

Table m shows typical results obtained in regard to ultrametricity. The di- 
mensionality can be considered as the embedding dimension. Here, although 
ultrametricity increases, and the equilateral configuration seems to be increas- 
ing but with decrease of the isosceles with small base configuration, we do not 
consider it of practical relevance to test with even higher ambient dimensional- 
ities. It is clear from the data, especially Figures [7] and [51 that the two signal 
models are very close in their properties. 

Examining the histograms of all inter-pair time series segments, both intra 
and inter cluster, we find the clearly distinguished peaks shown in Figure [9l 
As before, we use Euclidean distance between time series segments or vectors. 
(We note that normalization or other transformation is not particularly relevant 
here. In fact wc want to distinguish between inter and intra cluster cases. Fur- 
thermore the unweighted Euclidean distance is consistent with our use of angles 
to quantify triangle invariants, and hence respect for ultrametricity properties.) 

We find clearly distinguishable peaks in Figure [9l The lower and the higher 
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Figure 6: Sample (using first 1000 values) of a time series segment, based on the 
second ARIMA set of parameters. (Order 2 AR parameters: 0.2897, -0.1858, 
MA parameters: -0.7279,0.7488.) 
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Figure 7: Histogram of sample (using first 1000 values) of time series segment 
shown in Figure El 
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2nd ARIMA series (1000 values) 
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Figure 8: Histogram of sample (using first 1000 values) of time series segment 
shown in Figure [H 
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100 segments, each of 200k values 
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Figure 9: Histogram of distances from 100 time series segments, using 50 seg- 
ments each from the two ARIMA models, and using an embedding dimension- 
ality of 200,000. 
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peaks belong to the two ARIMA components. The central peak belongs to the 
inter-cluster distances. 

We have shown that our methodology can be of use for time series segmen- 
tation and for model identifiability. We will assess this further in future work. 
Given the use of a Hilbert space as the essential springboard of all aspects of 
this work, it would appear that generalization of this work to multivariate time 
series analysis is straightforward. What remains important, however, is the 
availability of very large embedding dimensionalities, i.e. very high frequency 
data streams. 

5 Conclusions 

What we have observed in all of this work is that in the limit of high dimen- 
sionality a Hilbert space becomes ultrametric. 

It has been our aim in this work to link observed data with an ultrametric 
topology for such data. The traditional approach in data analysis, of course, 
is to impose structure on the data. This is done, for example, by using some 
agglomerative hierarchical clustering algorithm. We can always do this (modulo 
distance or other ties in the data). Then we can assess the degree of fit of 
such a (tree or other) structure to our data. For our purposes, here, this is 
unsatisfactory. 

• Firstly, our aim was to show that ultrametricity can be naturally present in 
our data, globally or locally. We did not want any "measuring tool" such as 
an agglomerative hierarchical clustering algorithm to overly influence this 
finding. (Unfortunately [22j suffers from precisely this unhelpful influence 
of the "measuring tool" of the subdominant ultrametric. In other respects, 
[22] is a seminal paper.) 

• Secondly, let us assume that we did use hierarchical clustering, and then 
based our discussion around the goodness of fit. This again is a traditional 
approach used in data analysis, and in statistical data modeling. But such 
a discussion would have been unnecessary and futile. For, after all, if we 
have ultrametric properties in our data then many of the widely used 
hierarchical clustering algorithms will give precisely the same outcome, 
and furthermore the fit is by definition optimal. 

We have described an application of this work to very high frequency signal 
processing. The twin objectives are signal segmentation, and model identifica- 
tion. We have noted that a considerable amount of this work is model-based: 
we require assumptions (on clusters, and on model(s)) for identifiability. 

Motivation for this work includes the availability of very high frequency 
data streams in various fields (physics, engineering, finance, meteorology, bio- 
engineering, and bio- medicine). By using a very large embedding dimensionality, 
we are approaching the data analysis on a very gross scale, and hence furnishing 
a particular type of multiresolution analysis. That this is worthwhile has been 
shown in our case studies. 
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